Prosper: is the country’s first peer-to-peer lending marketplace. The company has provided more than $2,000,000,000 in loans. Loan interest rates range from 5.99% for the most credit worthy borrowers to 36.00% APR for consumers with lower credit ratings. Borrowers can obtain loans from $2,000 up to $35,000.
Financialand social rewards: As a peer-to-peer lending site, Prosper allows individuals and groups to lend money to their peers. Lenders reap both social and financial benefits from lending as well as greater returns than they would receive from a bank. This encourages people to fund loan requests. Borrowers understand that their loan is funded by other peers, not a traditional banking institution.
Easy rate retrieval: Customers quickly find out their loan interest rate by providing basic information online. Those borrowers with higher credit scores typically receive lower interest rates.
Simple funding process: Customers quickly find out their loan interest rate by providing basic information online. Those borrowers with higher credit scores typically receive lower interest rates.
No collateral required: Borrowers from Prosper don’t need collateral in order to qualify for a loan. Their funding is based on credit history as well as a few additional criteria.
Reputation: Prosper is Americas first peer-to-peer lending company and has a reputation for being a trustworthy and dependable personal loan company.
Best for People looking to refinance debt, individuals starting a business, consumers facing financial hardship and those looking to finance a major life event. Source (https://www.consumeraffairs.com/finance/prosper.html)
As a potential data scientist, I will explore the data about borrower market to learn about the borrowers behavior, loan demographic segmentation, and the performance of Prosper in terms of the volume of listings by year and by area.
Loan Data from Prosper: Last update: 03/11/2014. This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information.
To learn more about the data please visit this variable dictionary which explains the variables in the data set. https://goo.gl/m9hNi4
Analyzing loan bussiness in Prosper to understand the loan market by year, by usage of loan and by states
Exploring the relationships between numeric variables and how them affects each others.
Packages requiered: These packages are requiered for this EDA. To install, please execute the following code.
This data set contains 113,937 loans with 81 variables.
## [1] 113937 81
The data set includes different kind of variables like loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information. The variables are from different types. To do that I will use summary function which is a generic function used to produce result summaries of the results of various model fitting functions.
## ListingKey ListingNumber
## 17A93590655669644DB4C06: 6 Min. : 4
## 349D3587495831350F0F648: 4 1st Qu.: 400919
## 47C1359638497431975670B: 4 Median : 600554
## 8474358854651984137201C: 4 Mean : 627886
## DE8535960513435199406CE: 4 3rd Qu.: 892634
## 04C13599434217079754AEE: 3 Max. :1255725
## (Other) :113912
## ListingCreationDate CreditGrade Term
## 2013-10-02 17:20:16.550000000: 6 :84984 Min. :12.00
## 2013-08-28 20:31:41.107000000: 4 C : 5649 1st Qu.:36.00
## 2013-09-08 09:27:44.853000000: 4 D : 5153 Median :36.00
## 2013-12-06 05:43:13.830000000: 4 B : 4389 Mean :40.83
## 2013-12-06 11:44:58.283000000: 4 AA : 3509 3rd Qu.:36.00
## 2013-08-21 07:25:22.360000000: 3 HR : 3508 Max. :60.00
## (Other) :113912 (Other): 6745
## LoanStatus ClosedDate
## Current :56576 :58848
## Completed :38074 2014-03-04 00:00:00: 105
## Chargedoff :11992 2014-02-19 00:00:00: 100
## Defaulted : 5018 2014-02-11 00:00:00: 92
## Past Due (1-15 days) : 806 2012-10-30 00:00:00: 81
## Past Due (31-60 days): 363 2013-02-26 00:00:00: 78
## (Other) : 1108 (Other) :54633
## BorrowerAPR BorrowerRate LenderYield
## Min. :0.00653 Min. :0.0000 Min. :-0.0100
## 1st Qu.:0.15629 1st Qu.:0.1340 1st Qu.: 0.1242
## Median :0.20976 Median :0.1840 Median : 0.1730
## Mean :0.21883 Mean :0.1928 Mean : 0.1827
## 3rd Qu.:0.28381 3rd Qu.:0.2500 3rd Qu.: 0.2400
## Max. :0.51229 Max. :0.4975 Max. : 0.4925
## NA's :25
## EstimatedEffectiveYield EstimatedLoss EstimatedReturn
## Min. :-0.183 Min. :0.005 Min. :-0.183
## 1st Qu.: 0.116 1st Qu.:0.042 1st Qu.: 0.074
## Median : 0.162 Median :0.072 Median : 0.092
## Mean : 0.169 Mean :0.080 Mean : 0.096
## 3rd Qu.: 0.224 3rd Qu.:0.112 3rd Qu.: 0.117
## Max. : 0.320 Max. :0.366 Max. : 0.284
## NA's :29084 NA's :29084 NA's :29084
## ProsperRating..numeric. ProsperRating..Alpha. ProsperScore
## Min. :1.000 :29084 Min. : 1.00
## 1st Qu.:3.000 C :18345 1st Qu.: 4.00
## Median :4.000 B :15581 Median : 6.00
## Mean :4.072 A :14551 Mean : 5.95
## 3rd Qu.:5.000 D :14274 3rd Qu.: 8.00
## Max. :7.000 E : 9795 Max. :11.00
## NA's :29084 (Other):12307 NA's :29084
## ListingCategory..numeric. BorrowerState
## Min. : 0.000 CA :14717
## 1st Qu.: 1.000 TX : 6842
## Median : 1.000 NY : 6729
## Mean : 2.774 FL : 6720
## 3rd Qu.: 3.000 IL : 5921
## Max. :20.000 : 5515
## (Other):67493
## Occupation EmploymentStatus
## Other :28617 Employed :67322
## Professional :13628 Full-time :26355
## Computer Programmer : 4478 Self-employed: 6134
## Executive : 4311 Not available: 5347
## Teacher : 3759 Other : 3806
## Administrative Assistant: 3688 : 2255
## (Other) :55456 (Other) : 2718
## EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup
## Min. : 0.00 False:56459 False:101218
## 1st Qu.: 26.00 True :57478 True : 12719
## Median : 67.00
## Mean : 96.07
## 3rd Qu.:137.00
## Max. :755.00
## NA's :7625
## GroupKey DateCreditPulled
## :100596 2013-12-23 09:38:12: 6
## 783C3371218786870A73D20: 1140 2013-11-21 09:09:41: 4
## 3D4D3366260257624AB272D: 916 2013-12-06 05:43:16: 4
## 6A3B336601725506917317E: 698 2014-01-14 20:17:49: 4
## FEF83377364176536637E50: 611 2014-02-09 12:14:41: 4
## C9643379247860156A00EC0: 342 2013-09-27 22:04:54: 3
## (Other) : 9634 (Other) :113912
## CreditScoreRangeLower CreditScoreRangeUpper
## Min. : 0.0 Min. : 19.0
## 1st Qu.:660.0 1st Qu.:679.0
## Median :680.0 Median :699.0
## Mean :685.6 Mean :704.6
## 3rd Qu.:720.0 3rd Qu.:739.0
## Max. :880.0 Max. :899.0
## NA's :591 NA's :591
## FirstRecordedCreditLine CurrentCreditLines OpenCreditLines
## : 697 Min. : 0.00 Min. : 0.00
## 1993-12-01 00:00:00: 185 1st Qu.: 7.00 1st Qu.: 6.00
## 1994-11-01 00:00:00: 178 Median :10.00 Median : 9.00
## 1995-11-01 00:00:00: 168 Mean :10.32 Mean : 9.26
## 1990-04-01 00:00:00: 161 3rd Qu.:13.00 3rd Qu.:12.00
## 1995-03-01 00:00:00: 159 Max. :59.00 Max. :54.00
## (Other) :112389 NA's :7604 NA's :7604
## TotalCreditLinespast7years OpenRevolvingAccounts
## Min. : 2.00 Min. : 0.00
## 1st Qu.: 17.00 1st Qu.: 4.00
## Median : 25.00 Median : 6.00
## Mean : 26.75 Mean : 6.97
## 3rd Qu.: 35.00 3rd Qu.: 9.00
## Max. :136.00 Max. :51.00
## NA's :697
## OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries
## Min. : 0.0 Min. : 0.000 Min. : 0.000
## 1st Qu.: 114.0 1st Qu.: 0.000 1st Qu.: 2.000
## Median : 271.0 Median : 1.000 Median : 4.000
## Mean : 398.3 Mean : 1.435 Mean : 5.584
## 3rd Qu.: 525.0 3rd Qu.: 2.000 3rd Qu.: 7.000
## Max. :14985.0 Max. :105.000 Max. :379.000
## NA's :697 NA's :1159
## CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 0.0 1st Qu.: 0.000
## Median : 0.0000 Median : 0.0 Median : 0.000
## Mean : 0.5921 Mean : 984.5 Mean : 4.155
## 3rd Qu.: 0.0000 3rd Qu.: 0.0 3rd Qu.: 3.000
## Max. :83.0000 Max. :463881.0 Max. :99.000
## NA's :697 NA's :7622 NA's :990
## PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance
## Min. : 0.0000 Min. : 0.000 Min. : 0
## 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 3121
## Median : 0.0000 Median : 0.000 Median : 8549
## Mean : 0.3126 Mean : 0.015 Mean : 17599
## 3rd Qu.: 0.0000 3rd Qu.: 0.000 3rd Qu.: 19521
## Max. :38.0000 Max. :20.000 Max. :1435667
## NA's :697 NA's :7604 NA's :7604
## BankcardUtilization AvailableBankcardCredit TotalTrades
## Min. :0.000 Min. : 0 Min. : 0.00
## 1st Qu.:0.310 1st Qu.: 880 1st Qu.: 15.00
## Median :0.600 Median : 4100 Median : 22.00
## Mean :0.561 Mean : 11210 Mean : 23.23
## 3rd Qu.:0.840 3rd Qu.: 13180 3rd Qu.: 30.00
## Max. :5.950 Max. :646285 Max. :126.00
## NA's :7604 NA's :7544 NA's :7544
## TradesNeverDelinquent..percentage. TradesOpenedLast6Months
## Min. :0.000 Min. : 0.000
## 1st Qu.:0.820 1st Qu.: 0.000
## Median :0.940 Median : 0.000
## Mean :0.886 Mean : 0.802
## 3rd Qu.:1.000 3rd Qu.: 1.000
## Max. :1.000 Max. :20.000
## NA's :7544 NA's :7544
## DebtToIncomeRatio IncomeRange IncomeVerifiable
## Min. : 0.000 $25,000-49,999:32192 False: 8669
## 1st Qu.: 0.140 $50,000-74,999:31050 True :105268
## Median : 0.220 $100,000+ :17337
## Mean : 0.276 $75,000-99,999:16916
## 3rd Qu.: 0.320 Not displayed : 7741
## Max. :10.010 $1-24,999 : 7274
## NA's :8554 (Other) : 1427
## StatedMonthlyIncome LoanKey TotalProsperLoans
## Min. : 0 CB1B37030986463208432A1: 6 Min. :0.00
## 1st Qu.: 3200 2DEE3698211017519D7333F: 4 1st Qu.:1.00
## Median : 4667 9F4B37043517554537C364C: 4 Median :1.00
## Mean : 5608 D895370150591392337ED6D: 4 Mean :1.42
## 3rd Qu.: 6825 E6FB37073953690388BC56D: 4 3rd Qu.:2.00
## Max. :1750003 0D8F37036734373301ED419: 3 Max. :8.00
## (Other) :113912 NA's :91852
## TotalProsperPaymentsBilled OnTimeProsperPayments
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 9.00 1st Qu.: 9.00
## Median : 16.00 Median : 15.00
## Mean : 22.93 Mean : 22.27
## 3rd Qu.: 33.00 3rd Qu.: 32.00
## Max. :141.00 Max. :141.00
## NA's :91852 NA's :91852
## ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.00 Median : 0.00
## Mean : 0.61 Mean : 0.05
## 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :42.00 Max. :21.00
## NA's :91852 NA's :91852
## ProsperPrincipalBorrowed ProsperPrincipalOutstanding
## Min. : 0 Min. : 0
## 1st Qu.: 3500 1st Qu.: 0
## Median : 6000 Median : 1627
## Mean : 8472 Mean : 2930
## 3rd Qu.:11000 3rd Qu.: 4127
## Max. :72499 Max. :23451
## NA's :91852 NA's :91852
## ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
## Min. :-209.00 Min. : 0.0
## 1st Qu.: -35.00 1st Qu.: 0.0
## Median : -3.00 Median : 0.0
## Mean : -3.22 Mean : 152.8
## 3rd Qu.: 25.00 3rd Qu.: 0.0
## Max. : 286.00 Max. :2704.0
## NA's :95009
## LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber
## Min. : 0.00 Min. : 0.0 Min. : 1
## 1st Qu.: 9.00 1st Qu.: 6.0 1st Qu.: 37332
## Median :14.00 Median : 21.0 Median : 68599
## Mean :16.27 Mean : 31.9 Mean : 69444
## 3rd Qu.:22.00 3rd Qu.: 65.0 3rd Qu.:101901
## Max. :44.00 Max. :100.0 Max. :136486
## NA's :96985
## LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter
## Min. : 1000 2014-01-22 00:00:00: 491 Q4 2013:14450
## 1st Qu.: 4000 2013-11-13 00:00:00: 490 Q1 2014:12172
## Median : 6500 2014-02-19 00:00:00: 439 Q3 2013: 9180
## Mean : 8337 2013-10-16 00:00:00: 434 Q2 2013: 7099
## 3rd Qu.:12000 2014-01-28 00:00:00: 339 Q3 2012: 5632
## Max. :35000 2013-09-24 00:00:00: 316 Q2 2012: 5061
## (Other) :111428 (Other):60343
## MemberKey MonthlyLoanPayment LP_CustomerPayments
## 63CA34120866140639431C9: 9 Min. : 0.0 Min. : -2.35
## 16083364744933457E57FB9: 8 1st Qu.: 131.6 1st Qu.: 1005.76
## 3A2F3380477699707C81385: 8 Median : 217.7 Median : 2583.83
## 4D9C3403302047712AD0CDD: 8 Mean : 272.5 Mean : 4183.08
## 739C338135235294782AE75: 8 3rd Qu.: 371.6 3rd Qu.: 5548.40
## 7E1733653050264822FAA3D: 8 Max. :2251.5 Max. :40702.39
## (Other) :113888
## LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees
## Min. : 0.0 Min. : -2.35 Min. :-664.87
## 1st Qu.: 500.9 1st Qu.: 274.87 1st Qu.: -73.18
## Median : 1587.5 Median : 700.84 Median : -34.44
## Mean : 3105.5 Mean : 1077.54 Mean : -54.73
## 3rd Qu.: 4000.0 3rd Qu.: 1458.54 3rd Qu.: -13.92
## Max. :35000.0 Max. :15617.03 Max. : 32.06
##
## LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss
## Min. :-9274.75 Min. : -94.2 Min. : -954.5
## 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 0.00 Median : 0.0 Median : 0.0
## Mean : -14.24 Mean : 700.4 Mean : 681.4
## 3rd Qu.: 0.00 3rd Qu.: 0.0 3rd Qu.: 0.0
## Max. : 0.00 Max. :25000.0 Max. :25000.0
##
## LP_NonPrincipalRecoverypayments PercentFunded Recommendations
## Min. : 0.00 Min. :0.7000 Min. : 0.00000
## 1st Qu.: 0.00 1st Qu.:1.0000 1st Qu.: 0.00000
## Median : 0.00 Median :1.0000 Median : 0.00000
## Mean : 25.14 Mean :0.9986 Mean : 0.04803
## 3rd Qu.: 0.00 3rd Qu.:1.0000 3rd Qu.: 0.00000
## Max. :21117.90 Max. :1.0125 Max. :39.00000
##
## InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
## Min. : 0.00000 Min. : 0.00 Min. : 1.00
## 1st Qu.: 0.00000 1st Qu.: 0.00 1st Qu.: 2.00
## Median : 0.00000 Median : 0.00 Median : 44.00
## Mean : 0.02346 Mean : 16.55 Mean : 80.48
## 3rd Qu.: 0.00000 3rd Qu.: 0.00 3rd Qu.: 115.00
## Max. :33.00000 Max. :25000.00 Max. :1189.00
##
Tip: In this section, what I want to find are some geographic and seasonals behavioral patterns. To do that, I need to create some columns related to dates.
As we can observe, there are loans spread out around all country. This could mean that the company has network office distribute around the country, or maybe it offer a website which is known and accessed from every country. Also, this could mean that the company has a good reputation and confidence, so people around the country demands its services.
Even though the loans are around the country, there are some states which have more number of loans than others. The table and the graph below show the top five states with the higher amount of loans. In order descendent order these states are: California, Texas, New York, Florida, and Illinois.
| BorrowerState | state.name | LoanOriginalAmount_mean | LoanOriginalAmount_median | n |
|---|---|---|---|---|
| CA | california | 8974.326 | 7000 | 14717 |
| TX | texas | 9087.853 | 7500 | 6842 |
| NY | new york | 8833.034 | 7000 | 6729 |
| FL | florida | 8207.461 | 6500 | 6720 |
| IL | illinois | 8395.931 | 6500 | 5921 |
Althought the number of loans are important, more important is the amount of money these loans produced. So, let’s identify which states makes more money. As you can see the top five states which more money produced are: Texas, California, New York, Illinois, and Florida.
| BorrowerState | state.name | LoanOriginalAmount_mean | LoanOriginalAmount_median | n |
|---|---|---|---|---|
| TX | texas | 9087.853 | 7500 | 6842 |
| CA | california | 8974.326 | 7000 | 14717 |
| NY | new york | 8833.034 | 7000 | 6729 |
| IL | illinois | 8395.931 | 6500 | 5921 |
| FL | florida | 8207.461 | 6500 | 6720 |
Once I have a interesting idea about loan geographic segmentation, now I want to determine how loans evolved during the time. In the first graphic below, I can observe that two years (2006 and 2009) the loans were very low. Probably the first year (2006) was the beginning of the company, and the second there was something problem in the economy. During the other years, the amount of loans were growing because of Economy recovery. The other two graphics show us a constant value in the amount of money required in the loans.
Even though there are some years with more amount of loans (2013), I can not observe a seasonal behavior per month in a year. This could mean that people need money not for a specific reason like Holidays.
As the same of months behavior, we cannot observe in a clear way a seasonal behavior in the days of each month.
To complete this first analysis, I want to discoverer the principal reason why people get a loan. To do that, I will analyse the category of the listing that the borrower selected when posting their listing. Because this variable is numeric, I attached here the meaning of each number: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans
In the first graph, we can observe that the principal reason to make a loan during the years are the category 0 (Not Available), however if we limit to get the quantile 0.95 in the y axis, the principal columns are 0 (Not Available) and 7 (Other). We can also confirm this pattern in the second graphics where I split the data also for Term. It is totally clear that these reasult don’t give as much information, also we can mention that the dataset could be improve in this variable. Considering the next result, the most important categories are: 2 (Home Improvement) and 3 (Business).
Finally, we can determine about categories is that the amount of data is very regular in most of the categories.
This data set contains 113,937 loans with 81 variables. The data set includes different kind of variables like loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information. The variables are from different types. To do that I will use summary function which is a generic function used to produce result summaries of the results of various model fitting functions.
The most important features are:
The help support features are:
To analyze how loans behave during the time I created three columns: ListingCreationYear, ListingCreationMonth, and ListingCreationDay .
No features has unsual distributions.
In this bivariate analysis, what I want to find are common variable relationships according with some ideas about the loan business I have. In that order, I think the following list are natural relationships in loan business.
LoanOriginalAmount vs MonthlyLoanPayment -> Higher loan amount, higher monthly payment.
LoanOriginalAmount vs Investors -> Higher loan amount, higher number of investors.
EstimatedReturn vs Investors -> Higher estimated return, higher investors.
LoanOriginalAmount vs EmploymentStatusDuration -> Higher loan amount, higher employment status duration.
BankcardUtilization vs LoanOriginalAmount -> Higher bankcard utilization, higher loan amount.
The correlation between these two variables are strong positive. It means that higher loan amount result in higher monthly payment.
##
## Pearson's product-moment correlation
##
## data: LoanOriginalAmount and MonthlyLoanPayment
## t = 831.75, df = 108040, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9292039 0.9308148
## sample estimates:
## cor
## 0.9300138
Even though it exists correlation between these two variables, it is not strong. It means that it is not complete true that higher loan amount imply higher number of investors.
##
## Pearson's product-moment correlation
##
## data: LoanOriginalAmount and Investors
## t = 131.61, df = 108040, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3665643 0.3768423
## sample estimates:
## cor
## 0.3717147
The are not correlation between these two variables because the coefficient are very close to 0.
##
## Pearson's product-moment correlation
##
## data: EstimatedReturn and Investors
## t = -26.797, df = 84523, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.09846460 -0.08509518
## sample estimates:
## cor
## -0.09178403
The are not correlation between these two variables because the coefficient are very close to 0.
##
## Pearson's product-moment correlation
##
## data: LoanOriginalAmount and EmploymentStatusDuration
## t = 31.168, df = 104190, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.09009385 0.10212579
## sample estimates:
## cor
## 0.09611333
The are not correlation between these two variables because the coefficient are very close to 0.
##
## Pearson's product-moment correlation
##
## data: LoanOriginalAmount and BankcardUtilization
## t = -11.102, df = 104210, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.04043413 -0.02830564
## sample estimates:
## cor
## -0.03437115
Because just one scatter plot show a strong correlation, it is important to define another strategy to identify variable relationships. Also, it is important to mention that people can have some ideas about business which not really are true because of ignorance in that field. For that reason, it is important to be objective when we analyce data.
To improve the variables selection, I am going to calculate the correlation between all numeric variables. In order to achieve this activity, I will modify the data set to maintain only numeric data, and then change column names.
Now, I will drop NA values, and finally I will calculate correlations and show them with a graphic to easy understanding.
As result, the data which have a good correlation coefficient are:
In that order, the new variables chosen to analyze the correlation are:
The correlation between these two variables are strong positive. It means that higher loan amount result in higher monthly payment.
##
## Pearson's product-moment correlation
##
## data: LoanOriginalAmount and MonthlyLoanPayment
## t = 831.75, df = 108040, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9292039 0.9308148
## sample estimates:
## cor
## 0.9300138
The correlation between these two variables are strong positive. It means that higher loan amount result in less LP_ServiceFees.
##
## Pearson's product-moment correlation
##
## data: LoanOriginalAmount and LP_ServiceFees
## t = -176.49, df = 108040, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4776713 -0.4684142
## sample estimates:
## cor
## -0.4730558
The correlation between these two variables are strong negative. It means that higher ProsperRating..numeric. result in less EstimatedLoss.
##
## Pearson's product-moment correlation
##
## data: ProsperRating..numeric. and EstimatedLoss
## t = -1057.6, df = 84523, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9647015 -0.9637542
## sample estimates:
## cor
## -0.9642309
The list below show the initial relationships I thought were natural. However, most of these variables are not correlated each other.
LoanOriginalAmount vs MonthlyLoanPayment -> Higher loan amount, higher monthly payment.
LoanOriginalAmount vs Investors -> Higher loan amount, higher number of investors.
EstimatedReturn vs Investors -> Higher estimated return, higher investors.
LoanOriginalAmount vs EmploymentStatusDuration -> Higher loan amount, higher employment status duration.
BankcardUtilization vs LoanOriginalAmount -> Higher bankcard utilization, higher loan amount.
In order to identify more useful relationships, I calculate the correlation coefficient between all variables and I find the following list:
There are two strongest relationships I found: 1) CreditScoreRangeUpper - CreditScoreRangeLower —> 1 and the other: MonthlyLoanPayment - LoanOriginalAmount —> 0.9319837.
In this section, what I want to show is how affects a third support variable to the relationships I found in bivariate analysis. In the first image what we can observe is that most of the MonthlyLoanPayment are related to 36 and 60 months.
Also, we can see that most of the people have an employment and an important number of them have a full time jobs.
Then, we can see that a great number of loans were complete, however there are an important number which are currently opened.
Also it is important to mention that most of the loans have a Rate A, AA, and B which means that the portfolio of the company does not have a higher risk.
Finally, we can observe that the principal income of range from people who ask a loan is around $25000 to $75000.
Also, what I can observe in the following graph is that the MonthlyLoanPayment month mean increase according with the EmploymentStatusDuration, however the term no depends of the amount on MonthlyLoanPayment or EmploymentStatusDuration.
In addition, what I can observe in the following graph is that the MonthlyLoanPayment month mean increase in two categories of Listing meanwhile in the others are constant. Also, there is a very lower value in one category. Again, the term no depends of the amount on MonthlyLoanPayment or Listing category.
In the first image what we can observe is that most of the MonthlyLoanPayment are related to 36 and 60 months. Also, we can see that most of the people have an employment and an important number of them have a full time jobs.Then, we can see that a great number of loans were complete, however there are an important number which are currently opened. Also it is important to mention that most of the loans have a Rate A, AA, and B which means that the portfolio of the company does not have a higher risk. Finally, we can observe that the principal income of range from people who ask a loan is around $25000 to $75000.
Also, what I can observe is that the MonthlyLoanPayment month mean increase according with the EmploymentStatusDuration, however the term no depends of the amount on MonthlyLoanPayment or EmploymentStatusDuration
These two maps show us that number of loans not always means more money required. For example, in the first map you can observe that state which more number of loans is California (14717 loans), however the state which more amount of money demanded is Texas (9087.326 dollars). These graphs are very useful bacause them show in very clear form the difference between number and amount of money requiered. This result also could help company to determine what kind of strategy marketing they should apply. For example, if company wants to have more number of customers the could develop some strategies in California, but if they want to put more money probably it is a better option to focus in Texas.
In this graph, what we can find is that prosper rating decreased while estimated loss increased. It means that the risk is more higher when people have low prosper ratings. This relationship can also be validated if we calculate the coefficient of correlation between these two variables (-0.9641819). Moreover, this graph is helpful because it provides a useful way to visualise the range and other characteristics of responses for a large group.
What we can see in this graph is that Monthly Payment is higher when the Term of paying is lower. This means that Monthly Payment is higher when Term is 12 months no matter which kind of category is the loan. Also, we can see that Monthtly Payment is also higher when Term is 60 months. What it is important in this graph is the ease with which it allows you to see how three variables interacts each other.
To make a good analysis it is important to know about the business, meet how the process is, and know what each variable means. If a person does not have much experience in this field the time to analyze data could growth exponentially, the analisys could not be precise, and the result could be wrong. In my case, I had troubles selecting the variables and building logical relationships between them. What I did to fix this issue was read about the business, and try to use helpful tools to help me in my analys. One of this tool which corrplot which helped me to identify relationships in less time.
After this work, I hope to gain more experience using r not just for EDA but also to predict things. I want to learn how to apply machine learning algorithms with r. Also, I want to analyze different data set to gain more skills using r.